Temporal response characterization across individual multiomics profiles of prediabetic and diabetic subjects

Longitudinal deep multiomics profiling, which combines biomolecular, physiological, environmental and clinical measures data, shows great promise for precision health. However, integrating and understanding the complexity of such data remains a big challenge. Here we utilize an individual-focused bottom-up approach aimed at first assessing single individuals’ multiomics time series, and using the individual-level responses to assess multi-individual grouping based directly on similarity of their longitudinal deep multiomics profiles. We used this individual-focused approach to analyze profiles from a study profiling longitudinal responses in type 2 diabetes mellitus. After generating periodograms for individual subject omics signals, we constructed within-person omics networks and analyzed personal-level immune changes. The results identified both individual-level responses to immune perturbation, and the clusters of individuals that have similar behaviors in immune response and which were associated to measures of their diabetic status.

The development of novel technologies in personal health monitoring devices, high throughput sequencing and computational methods has generated massive omics data, and provides both a great opportunity and challenge to precision health [1][2][3][4][5] . The big data provides plentiful health information ranging from biomolecular, physiological, and environment data to clinical measures. This information helps identify potential deviations from a healthy baseline and improves health risk predictions 1 . A big challenge of a big data approach to precision health is how to integrate and understand these multi-dimensional, extremely diverse sources, with highly heterogeneous data 3 . Early efforts by Chen, Mias, Li-Pook-Than et al. focused on assessing the feasibility of integrated Personal Omics Profiling (iPOP), by utilizing a multiomics integration framework to interpret healthy and diseased states followed through an individual's blood-based multiomics assessment 6 . More recent efforts by Sara Ahadi et al. revealed personal aging markers by using deep longitudinal profiling 7 , Abdellah Tebani et al. discovered how the personal cohort changes during the wellness period 8 . Environmental effects have also been studied by M. Reza Sailani et al., revealing two biological seasonal patterns in California by multiomics profiling 9 . Wearable sensors have also been used in digitalized health in tracking physiomes and activity 10 . Other implementations have used multiomics to monitor the drug responses 11 . Non-invasive longitudinal saliva multiomics have been recently used by Mias et al. to monitor immune responses in a vaccinated individual 12 . Although these efforts have shown the great promise of deep multiomics profiling, the complexity of data presents limitations for practical implementations. Deep multiomics data come from diverse sources, and have different types, sizes and ranges, which complicates comparisons between different individuals' personal multiomics. In the Pioneer study by Price et al. 13 dynamic data clouds were used for longitudinal monitoring of individual subjects, in a study that also incorporated behavioral coaching to improve clinical biomarkers. In recent work, Zhou et al. 14 carried out iPOP across multiple individuals, and built correlation networks of molecular associations. However, to the best of our knowledge, direct networks of individuals associated with longitudinal individual deep multiomics profiles have not been constructed.
In this investigation we took an individual-focused approach to categorize personal longitudinal deep multiomics profiles and group individuals into communities, using spectral representations of individual multiomics time series. In taking this individual-focused approach, one of our goals was to perform an analysis closer to clinical applications, where inherently the individual is monitored over time to enable a personal diagnosis. We Methods Summary of cohort details and data. The original data used in this analysis comes from the study by Zhou et al. 14 , that focused on multiomics characterization of host-microbe dynamics in prediabetics. The measures, SSPG, Matsuda, DI and isrMax, came from the other paper of the same project by Shussler-Fiorenza Rose et al. 4 . All data obtained were made publicly available by the original authors 4,14 as described therein (under Stanford IRB No. 23602), and no additional institutional review board approvals (IRB) were required for this investigation.
The participants had been classified as diabetic or prediabetic in the original study according to one of the following three criteria 14 : (i) haemoglobin A1C (A1C; diabetic ≥ 6.5% > prediabetic ≥ 5.7% ), (ii) fasting glucose (diabetic ≥ 126 mg dl −1 > prediabetic ≥ 100 mg dl −1 ) and (iii) based on an annual oral glucose tolerance test (OGTT; diabetic ≥ 200 mg dl −1 > prediabetic ≥ 140 mg dl −1 at 2 h). The different subjects in the study had highly heterogeneous visit records: some subjects only had one visit record but one subject has more than 150 visit records (time points), which is about 15 times more than the average of 10 visits per subject. Multiple omics were generated from the subjects including blood based transcriptomics, microbiome data (nares/ gut), cytokine measurements and clinical measures. To ensure the individual omics profiles had enough time points and time series, we filtered records from individuals so that the number of time points N t >= 4 and the number of omics N o > 500 , across participants from the data source 14 . We also excluded the subject with the 150+ visits records, as this was not comparable directly to other subjects in our analysis, given the density of points. Our final filtered dataset contained 69 subjects from the original data source. Figure 1a shows the age distribution across sexes for the subjects. We had 34 males and 35 females, and most subjects were older than 50. Summary distributions of the subjects' observation window are shown in Fig. 1b Fig. 1d. As RNA-seq provides a comprehensive and accessible map of the transcriptome, with more omics profiled (by number comparison) compared to other-omes (e.g. proteomes/metabolomes, etc.) we expect that the majority of future omics profiling studies data will follow similar trends (as has been the case so far), though we do expect more microbiome data to emerge, as the host-microbiome interaction is still under considerable investigation.

Data preprocessing.
To obtain an individual's omics profile, we combined all the omics source dataset into a dataframe, then separated into dataframes for each individual. Since our workflow has two branches: single subject analysis and multi-subject similarity analysis, as seen in Fig. 2, the following data preprocessing was carried out for the two branches: (i) For the single subject analysis, we selected the signals with less than 25% time points missing from each individual's dataframe as the input for single subject analysis, using each individual's time points as possible measurement points. (ii) For multi-subject analysis, we sorted each individual's time frame from their dataframes, then combined all individual time frames to get all the possible time points as the common time frame. We then calculated Lomb-Scargle periodograms from each individual's dataframe using this common time frame as the set of possible measurement points. The transformed dataframe was then used as the input for the multi-subject analysis.

Individual subject analysis. Time series categorization. Individual subject analysis was carried out in
Python, with the package PyIOmica utilities for time series categorization 15 (command calculateTime-SeriesCategorization). Briefly, for each subject s, for each omics i its time series X was analyzed over its constituent timepoints. The individual omics intensities at timepoint j were compared to the initial timepoint by subtracting its intensity from all values, X is (t j ) = X is (t j ) − X is (0) . The data were then normalized to a time series Q, using the Euclidean norm, so that The algorithm's classification of signals into trends uses spectral methods, as previously described 12,15 . Briefly, for each signal a Lomb-Scargle periodogram is calculated as a list P is . The inverse Fourier transform of P is returns a list of autocorrelations, {ρ isk } , where k ∈ {0, . . . , N/2} is the lag. In parallel with the original time series signals, a bootstrap set of 10 5 time series was generated by sampling from the original data with replacement. The autocorrelations at different lags of the bootstrap set were computed to generate an autocorrelation null distribution for each lag from which a set of cutoffs {ρ ck } were obtained corresponding to a 0.95 quantile. A time series was then assigned to a class labeled with the lowest Lag l for which the series' autocorrelation ρ isl is larger than the cutoff, i.e., where l = Min j : ρ isj ≥ ρ cj , and j ∈ 1, . . . , k.
If a time series X is does not have autocorrelations that satisfy the cutoff criteria, the algorithm then checked if the series has a pronounced peak or trough at any time point. The time series' maximum, max is = maxX is , and minimum, min is = minX is were compared to {max cn , min cn } , which are maxima and minima cutoffs from www.nature.com/scientificreports/ null distributions again computed using the bootstrap time series for all possible time series lengths n. X is is then labeled as a SpikeMax signal if max is > max cn , or as a SpikeMin if min is < min cn . A time series that did not meet any of the cutoff criteria was not labeled as having a statistically significant trend. The approach categorizes each omics signal based on its own trend, and is thus not affected by differences in the omics modality, making it directly extensible to incorporating different kinds of time series data 16 .
Clustering and heatmaps. After classification, and using PyIOmica's clusterTimeSeriesCategorization function, we carried out a two-tier hierarchical clustering (agglomerative; complete linkage) for each temporal class, to identify groups (G) and sub-groups (S). The clustering grouping used a similarity based on {ρ isk } (for the autocorrelation classification) and {Q is } for the second tier. Groups and subgroups were determined using a silhouette algorithm 17 . The results were visualized for each subject and every temporal class identified using PyIOmica's visualizeTimeSeriesCategorization. Example outputs are shown in Fig. 3 and included in the Online Data Files (ODFs). The first tier of clustering aims at capturing the autocorrelation structure, and hence the dominant pattern in the data. The second tier clustering based on the data values can distinguish pattern variations, and particularly sign differences that the autocorrelations (being the inverse Fourier transform of the periodogram frequencies) would not capture 16 .
Reactome enrichment analysis. Reactome 18 pathway enrichment analysis was carried out for each Group/Subgroup and each subject using the Reactome application programming interface (API) in PyIOmica. Examples are shown in Table 1 for two subjects, and complete output for all subjects is included in the ODFs.

Across subject comparisons.
Individual results aggregation. The individual subject omics that showed statistically significant trends were aggregated to identify consistency across individuals. Signals identified as having statistically significant trends, FDR < 0.05 , in more than 50% of the individuals, are shown in Table 2. www.nature.com/scientificreports/ Network construction. The network analysis was carried out in Python, using networkx 19 and scikitnetwork 20 . First the time series periodograms for all the omics time series were computed for all subjects using the LombScargle function in PyIOmica. Next, for pairs of subjects p, q and for each omics time series i, a Spearman correlation matrix S i was constructed. In parallel, a bootstrap simulation of 50,000 time series was also generated from the data, as a null distribution, and the pairwise Spearman correlations were computed for these as well to determine a Spearman correlation cutoff for significance, s c at the 0.99 quantile level. Entries were kept that were most correlated to each other, by creating a restricted distance matrix R i , such that A weighted network was then constructed, with the subjects represented as nodes, with an adjacency matrix A, constructed as A = i R i . The entries p, q of the adjacency matrix represent the connections in the network. A nonzero entry A p,q means there is an edge connecting nodes (subject) p and q. The magnitude of A p,q provides a weight of the edge. In summary, the edges connecting pairs of nodes (subjects) were added to the network if there was at least one omics for which the Spearman correlation between the subjects was greater than s c , and the edges were weighted by the number of omics that met this criterion for each pair of subjects. Thus, the network captures the similarity between subjects, putting more weight on edges connecting subjects with a higher number of similar omics signals.

Network communities calculation.
To determine the network community structure, a k-means approach was used. The computation used scikit-network's clustering.kmeans, applying an embedding, and utilizing singular value decomposition with dimension one. The number of communities (4) was selected based on the elbow and silhouette methods 17 , and the sklearn.metrics.silhouette_score 21 was used for the silhouette scores calculation.
Mann-Whitney U tests. Subject measurements were compared between members of the communities in the network calculations (see above). We used non-parametric Mann-Whitney U tests 22 , to test for statistically significant pairwise differences across communities (p value < 0.05). The results shown in Table 3 were computed using the scipy.stats.mannwhitneyu Python functionality 23 .

Results
Single subject analysis. We used PyIOmica's 15   www.nature.com/scientificreports/ Once sets of omics that show similar profiles in time are identified, we can assess the biological significance of these temporal associations. Following classification, we carried out Reactome pathway 18 enrichment analysis for the genes that showed statistically significant trends for each subject. The statistically significant (False Discovery Rate, FDR < 0.05) pathways results for subject ZKFV71L and ZTMFN3O for the autocorrelation Multi-subject similarity analysis. Based on the individual results, we first aggregated the omics that showed statistically significant trends in each individual to identify the signals that are consistent across the www.nature.com/scientificreports/ majority of individuals ( > 50 % of subjects, number of occurrences ≥ 35 ), included in Table 2. We found that high frequency signals came from 3 data sources: cytokines, nares and gut microbiome. Many studies have now shown that cytokines have a profound relationship with type 2 diabetes 25-27 , our findings are consistent with these previous works and provide potential biomakers for type 2 diabetes (see also Discussion).
To further investigate common responses across individuals, we created a multi-subject similarity network. The network was constructed by comparing the Spearman correlation of the spectral time series representation (periodogram) across common omics between every pair of subjects. The network, shown in Fig. 4a, has nodes corresponding to the 69 subjects, with weighted edges corresponding to the number of omics that showed similar temporal behavior. We used a k-means algorithm to calculate communities of the network (see "Methods" section for details). Four communities were identified, denoted as Community 0, Community 1, Community 2 and Community 3. Community 0 has 28 individuals, 15 Females and 13 Males, ages from 29 to 69, with disease status including 19 Prediabetic, 1 Diabetic, 5 Crossover and 3 Control. Community 1 has 16 individuals, 6 Females and 10 Males, ages from 33 to 75, 12 Prediabetic, 1 Crossover and 3 Control. Community 2 has 15 individuals, 9 Females and 6 Males, age from 39 to 67, 13 Prediabetic, 2 Crossover. Community 3 has 10 individuals, 5 Females and 5 Males, age from 39 to 70, 6 Prediabetic, 2 Crossover and 2 Control. We then compared clinical measures between the subjects in the community, including body mass index (BMI), disposition index (DI), steady-state plasma glucose (SSPG), Matsuda index (Matsuda), and maximum insulin secretion rate (isrMax). The violin plots show group separated by community and sex, Fig. 4b-f. These 5 distribution figures qualitatively indicate that the Community 0, Community 1 and Community 2 have similar distribution but have large differences with Community 3. We also found that the female and male subjects have different distributions, even within the same community for BMI, DI and isrMax measures. We carried out non-parametric Mann-Whitney U tests 22,28 to compare across the different communities for statistical significance (p value < 0.05), with the results summarized in Table 3. The 5 measures do not show statistical significant differences between Community 0 and Community 1,Community 0 and Community 2, and Community 1 and Community 2. The BMI and SSPG distributions between Community 0 and Community 3, and between Community 2 and Community 3 have statistically significant differences, the BMI distributions between Community 1 and Community 3 also have statistically significant differences, indicating that the subjects in Community 3 have statistical difference in physiological states comparing with Community 0, Community 1 and Community 2. Differences in Male vs. Females in the comparisons, particularly for BMI, DI and isrMax, suggest that even though overall subjects in these three communities may have similar physiological states and responses, females and males still display different physiological states (though the low number of subjects is affecting power in further breaking down of community differences).
We next ranked the omics in each community (represented by the weighted edges), by their frequencies of occurrence. We then carried out Reactome Pathway Enrichment analysis for the top 25% ranked genes of each community (to reduce noise effects from low frequency genes). The statistically significant pathway results (FDR < 0.05) are shown in Table 4. Community 0 has 16 statistical significant Reactome pathways, related to immune response. For Community 2 the statistically significant Reactome pathways include TRKA activation by NGF, DDX58/IFIH1-mediated induction of interferon-alpha/beta, Activation of TRKA receptors, and Ca2+ activated K+ channels. Community 1 and Community 3 did not have statistically significant Reactome results. www.nature.com/scientificreports/  www.nature.com/scientificreports/ The Reactome pathway Enrichment analysis, along with the subjects' BMI, DI, SSPG, Matsuda and isrMax distributions differences indicate that these affect temporal immune responses in different detected Communities, and also have sex differences (BMI and SSPG) between two of the identified communities (Community 0 and Community 3).
Result summary. In summary, our results separate individuals with different physiological conditions and immune response based on the multiomics time series similarity of individual profiling analyses. We can detect changes in each individual compared to their own baseline that coincide with immune responses in each subject, and correspond to the individual time of onset per individual. The groups of genes discovered are associated with immune-relevant pathways, including antigen processing and presentation, interferon signaling and interleukin signaling (e.g., Table 1). The trends detected are consistently detected across multiple individuals (Table 2), and include multiple cytokines (IP10, IL13, PDGFBB, TGFA, IL27, TGFB, MCP1, GCSF, MIG, IL22, CD40L, VEGF, SCF and EGF) as well as bacterial signals (with top results by occurrence in the highest number of individuals including genus Streptococcus in both nares and gut and Blautia in gut, classes Gammaproteobacteria and Betaproteobacteria in nares, family Streptococcaceae in both nares and gut, and phyla Proteobacteria in gut and Bacteroidetes in nares, and order Lactobacillales in gut). Finally, a network with nodes representing subjects and weighted edges omics with similar temporal behavior across individuals, shown in Fig. 4, separated the individuals in 4 communities, with statistically significant differences (p value < 0.05) detected in BMI, SSPG, DI and isrMax, including sex differences, Table 3. The network communities detected involve different pathways for the top genes (highest weight edges) with similar behavior between subjects, Table 4. This involves in one group of subjects (Community 0) several immune responses (antigen processing and presentation, interferon and cytokine signaling, and several pathways also identified in SARS-CoV2-2 immune responses, as multiple immune-related genes are involved), and different pathways in a second group (Community 2, including TRKA activation by NGF, DDX58/IFIH1-mediated induction of interferon-alpha/beta, Activation of TRKA receptors and Ca2+ activated K+ channels).

Discussion
In this manuscript, we applied spectral methods to analyze multiomics individual profiles from public data for 69 individuals. Our goals were to take an individual-focused approach and: (i) detect molecular-level deviations from each individual's own baseline in response to dynamic changes in their physiological states, and (ii) build on the individual results to compare across subjects in a bottom-up approach and identify common molecular signatures. We generated periodograms for individual subject omics time series categorization, constructed within-person omics networks and detected personal-level immune changes corresponding to the individual's physiological state changes. We identified similar individual-level responses to immune perturbation across multiple subjects. We then used the periodograms across subjects to identify network clusters of individuals with similarities across their common omics temporal patterns. The multi-individuals' similarity network revealed different communities within which the molecular behavior was linked to phenotypic differences, including body mass index and insulin resistance, with the immune response dominating differences attributed to diabetic status, Table 3.
Our results are consistent with and extend previous research that has reported several cytokines with important roles in type 2 diabetes development, including cytokines from our findings across individuals in Table 2, with some examples highlighted below. Elevated concentrations of IP10 (CXCL10) have been reported in type 2 diabetes, and are associated with higher diabetes risk 29,30 . The IL13 pathway is a potential therapeutic target for glycemic control in type 2 diabetes 31 . IL27 has been implicated in insulin resistance in genome wide association studies 32 . Wang et al. had reported the pathogenic role of IL27, using diabetic NOD mice to investigate T-cell mediated autoimmune diabetes 33 , but recently the role of IL27-IL27Rα in promoting adipocyte thermogenesis has been investigated in the context of treating insulin resistance 34 . Decreased plasma IL22 level was found to be a potential trigger of impaired fasting glucose and type 2 diabetes, in a retrospective study of Han Chinese subjects 35 . PDGFBB is reported as associated with type 2 diabetes mellitus and complications 36,37 . TGFA 38 and TGFB 39 have shown a pathologic contribution in diabetic kidney disease. TGFB is also reported associated with type 2 diabetic nephropathy 40 . ALKP has been investigated as an independent predictor for diabetes incidence 41,42 . MCP1 has been found significantly increased in patients with type 2 diabetes 43 . Furthermore, through rat studies GCSF has been reported as a potential novel therapeutic drug in early diabetic nephropathy patients 44 . The involvement of MIG (CXCL9) in the progression of type 2 diabetes nephropathy has been reported 45 . CD40-CD40L has been associated with type 2 diabetes mellitus 46 , VEGF is involved in the pathogenesis of diabetic complications 47 , c-Kit and its ligand, stem cell factor (SCF) have been reported as a potential novel target for treating diabetes 48 . Finally, chemotactic cytokines, including eosinophil chemotactic factors (ECFs), have been shown to be related to type 2 diabetes mellitus 49 . In our results, the cytokines above had high occurrence rates with statistically significant trends across the diabetic and prediabetic individuals. The findings suggest that our approach has potential application in clinical trials to identify disease biomakers and treatment targets.
In our analysis we used unsupervised methods to classify time-resolved trends in each subject, as well as to construct the network and identify the communities that showed differences in different diabetes-relevant measures (BMI, SSPG, DI and isrMax), Fig. 4 and Table 3. As the analysis did not depend on prior knowledge of the prediabetes/diabetes status of the subject, and given that the prediabetes-diabetes distinction is a rather continuous spectrum this suggests that potential pathophysiological signals are playing a part, and merits further future investigation.
Our study has limitations: In the current study we focused on immune changes and perturbations as these were available in the source data. As there are multiple other pathogenic factors that contribute to T2D, the data www.nature.com/scientificreports/ do not offer a complete picture and will need to be supplemented with additional followup studies. The data used in this investigation contained fine-grained omics data, and comparatively crude phenotypic data. Future investigations can address this at the outset of the experimental design prior to data collection. Furthermore, we have focused on immune markers, as these were the perturbation data available. To comprehensively study T2D more kinds of perturbations from baseline need to be investigated, including other disorders. Furthermore, the transcriptomic data used were generated from bulk RNA-sequencing, and hence do not allow for cell-type specific analyses, which are important is evaluating T2D. We expect that cell-type-specific studies with longitudinal data will become more prevalent with the recent focus on single-cell RNA-sequencing approaches. In terms of the data, there is uneven sampling (inherent in any real-world/subject based study), which we have addressed with our approach, as well as having different lengths of time series across individuals. Also, the RNA-seq data modality dominates in terms of number of omics. Our approach is robust as it analyzes single omics signals when assigning autocorrelation classes, irrespective of modality and hence is not affected by the unevenness between modality sizes. In the dendrogram constructions of determining groups and subgroups, it is possible that the trends are dominated by the large RNA-seq data, and potentially overfitting to a particular trend (i.e. missing subtle trend variation). Still, in this unsupervised clustering approach any distinct singleton omics patterns will be displayed. Furthermore, environmental measures are not included and the low number of participants does not allow for a nuanced analysis of heterogeneity across subjects. For example, different subjects received different treatments which are analyzed in bulk. In terms of broad applicability of data collection and analysis, the data are obtained using invasive approaches (at least for blood components), which should improve with non-invasive transcriptomic mapping, such as using saliva. While our analysis included microbiome data (nares and gut), their association with transcriptome results, the cytokines and other measurements noted in Table 2, and their mechanistic role still requires further model-based experimentation. Finally, time series analyses with multiomics are an evolving area of research. We currently do not have T2D datasets that can be used for validation of findings. We anticipate that more studies will be generated that will also provide the necessary data to enable validation beyond the discovery approach in this investigation. Such data will also allow us to evaluate different methodologies, as there are many approaches to time-series analysis that we intend to continue investigating. In this investigation we have used the spectral methods as a first approach to detect data patterns in an unsupervised analysis. The spectral methods aid in addressing missing data/uneven sampling while maintaining the statistical properties of the data, and provide a streamlined pattern recognition based on frequency/autocorrelation behavior which is often used in time series analyses. Future work can also employ different approaches, especially for longer time series and with higher sampling rates to aid in the development of differential equation models of departures from signal baselines in response to perturbations, that provide more mechanistic interpretations of T2D dynamics.
In this manuscript we performed an individual-focused analysis of a prediabetes study cohort 4,14 , which revealed insights from this longitudinal dataset and can lead to actionable health discoveries, providing relevant information for precision health monitoring. Our approach is the first, to our knowledge, with a temporal bottom-up/individual focus. Starting from individual microscopic measurements (molecular level omics), intra-subject immune responses can be characterized. Then, building on the individual responses, a macroscopic inter-subject temporal clustering of subjects based on temporal similarity provides information on how immune responses can be related to diabetic states, consistent with and supplementing the original work.
In summary, our findings utilize personal temporal omics to identify collective responses across individuals associated with macroscopic characteristics, and provide an approach that can potentially help predict disease responses and outcomes towards clinical implementations. The approach is non-disease specific, and extensible: any time signal measurement can be incorporated and any number of individuals can be compared spanning periods of individualized wellness and departures therefrom. Additionally, real-world application limitations of missing data and uneven sampling can be addressed. Beyond identifying trends within an individual, expanding to larger cohorts can eventually provide individual temporal signatures of specific disease onset. Such temporal disease signatures can be used to train models for monitoring departures from a healthy baseline, towards prevention or minimally timely treatments. Coupling multi-timepoint monitoring with non-invasive sampling (e.g. saliva 12 ) can help eventually provide affordable population-level precision health.

Data availability
The original data analyzed in this investigation were made publicly available by Zhou et al. 14 and Schussler-Fiorenza Rose et al. 4 as described therein, on https:// med. stanf ord. edu/ ipop. html. All data files used in this investigation, including original code and results files have been released on Github (https:// github. com/ gmias lab/ Tempo ralMu ltiom icsDi abetes), and also deposited on Zenodo (https:// doi. org/ 10. 5281/ zenodo. 67519 60), and are referred to as Online Data Files (ODFs) in the manuscript. www.nature.com/scientificreports/